To visualize & analyze card usage and car movement data with the employee disappearance incident
This study is based on the Mini-Challenge 2 of the VAST Challenge 2021. In a fiction scenario, there is a natural gas company named “GASTech” operating in the island country if Kronos. The GASTech didn’t do well in environment stewardship. And after an company IPO celebration in January 2014, several employees of GASTech went missing. An environment organization is suspected in the disappearance.
Many of the Abila, Kronos-based employees of GAStech have company cars which are approved for both personal and business use. Those who do not have company cars have the ability to check out company trucks for business use, but these trucks cannot be used for personal business.
Employees with company cars are happy to have these vehicles, because the company cars are generally much higher quality than the cars they would be able to afford otherwise. However, GAStech does not trust their employees. Without the employees? knowledge, GAStech has installed geospatial tracking software in the company vehicles. The vehicles are tracked periodically as long as they are moving.
This vehicle tracking data has been made available to law enforcement to support their investigation. Unfortunately, data is not available for the day the GAStech employees went missing. Data is only available for the two weeks prior to the disappearance.
To promote local businesses, Kronos based companies provide a Kronos Kares benefit card to GASTech employees giving them discounts and rewards in exchange for collecting information about their credit card purchases and preferences as recorded on loyalty cards. This data has been made available to investigators in the hopes that it can help resolve the situation. However, Kronos Kares does not collect personal information beyond purchases.
Use visual analytics to identify which GASTech employees made which purchases and identify suspicious patterns of behavior. Besides, the study must cope with uncertainties that result from missing, conflicting, and imperfect data to make recommendations for further investigation.
Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies? Please limit your answer to 8 images and 300 words.
Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find? Please limit your answer to 8 images and 500 words.
Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data? Please limit your answer to 8 images and 500 words.
Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships. Please limit your response to 8 images and 500 words.
Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why Please limit your response to 10 images and 500 words.
The VAST Challenge 2014 has the same scenario with slightly different dataset and questions. The submission repository can be found here.
Various analytic tools were used among the submissions, like JMP, D3 and custom tools. The heatmap and time histograms were useful to represent the numerical value under the combination of one categorical variable and one discrete/categorical variable, such as the usage frequency under different locations and days. Besides, movement line graph with the map background can help to identify and check suspicious activities.
However, almost all graphs were static and readers would find it difficult to explore other parts in graphs which were not specially mentioned by authors. Since the study is displayed on html page, the interactive graphs will be possible. For example, the tooltip function can make every data point to have detailed information without checking the axis or drawing additional graphs. The zoom-in and onclick functions allow readers to check the whole complex graph with too many lines/objects and focus on one part only.
Import packages.
The location names contain some special characters, such as “Café”, which are not recognized by utf-8 encoding. Thus, special encoding is used in reading data.
loyalty <- read_csv("data/loyalty_data.csv", locale=locale(encoding ="windows-1252"))
cc <- read_csv("data/cc_data.csv", locale=locale(encoding ="windows-1252"))
Take a glimpse of credit card data and loyalty card data
knitr::kable(cc[c(0:5),],
caption = "Credit Card Usage Data") %>%
kableExtra::kable_paper("hover", full_width = F)
| timestamp | location | price | last4ccnum |
|---|---|---|---|
| 01/06/2014 07:28 | Brew’ve Been Served | 11.34 | 4795 |
| 01/06/2014 07:34 | Hallowed Grounds | 52.22 | 7108 |
| 01/06/2014 07:35 | Brew’ve Been Served | 8.33 | 6816 |
| 01/06/2014 07:36 | Hallowed Grounds | 16.72 | 9617 |
| 01/06/2014 07:37 | Brew’ve Been Served | 4.24 | 7384 |
knitr::kable(loyalty[c(0:5),],
caption = "Loyalty Card Usage Data") %>%
kableExtra::kable_paper("hover", full_width = F)
| timestamp | location | price | loyaltynum |
|---|---|---|---|
| 01/06/2014 | Brew’ve Been Served | 4.17 | L2247 |
| 01/06/2014 | Brew’ve Been Served | 9.60 | L9406 |
| 01/06/2014 | Hallowed Grounds | 16.53 | L8328 |
| 01/06/2014 | Coffee Shack | 11.51 | L6417 |
| 01/06/2014 | Hallowed Grounds | 12.93 | L1107 |
The timestamp in the credit card usage date (“cc”) contains date and time, while the timestamp in the loyal card usage data (“loyalty”) contains only data. Besides, their data type is string, which will be transformed into datetime type.
And we separate day, hour from the datetime feature.
gps <- read_csv("data/gps.csv")
knitr::kable(gps[c(0:5),],
caption = "GPS Data") %>%
kableExtra::kable_paper("hover", full_width = F)
| Timestamp | id | lat | long |
|---|---|---|---|
| 01/06/2014 06:28:01 | 35 | 36.07623 | 24.87469 |
| 01/06/2014 06:28:01 | 35 | 36.07622 | 24.87460 |
| 01/06/2014 06:28:03 | 35 | 36.07621 | 24.87444 |
| 01/06/2014 06:28:05 | 35 | 36.07622 | 24.87425 |
| 01/06/2014 06:28:06 | 35 | 36.07621 | 24.87417 |
The timestamp in the GPS data also need to be transformed.
And the longitude and latitude are rounded into 5 digits. It can avoid the inconsistent/inaccurate GPS data to some extent. And five decimal places implies 1.11 meters accuracy, which is better than 4 or 6 digits (11.1 meter or 0.11 meter accuracy) under this question scenario.
In the challenge page, it mentioned that the vehicles are tracked periodically as long as they are moving. Thus, the time gap in the GPS data within one car indicates that this car stopped at current GPS location. Stops correspond to local business locations or other locations. To find these business locations, we excluded out the time gap less than 3 minutes, which might be that the car stopped to wait for traffic light.
gps2 <- gps2 %>%
group_by(id) %>%
mutate(end = Timestamp,
start = lag(Timestamp, default = first(Timestamp),
order_by = Timestamp),
diff_mins = difftime(end, start, units = "mins")) %>%
mutate(stop = ifelse(diff_mins >= 3, TRUE, FALSE)) %>%
filter(stop == TRUE) %>%
ungroup()
# rearrange useful features
gps2_stop <- gps2[c(7,6,2,3,4,8,5)]
gps2_stop_sf <- st_as_sf(gps2_stop,
coords = c("long", "lat"), # combine the lo, la
crs = 4326) # 4326 is wgs84 Geographic Coordinate System
The “start” in the “gps2_stop_sf” refers to the time when the car starts parking, while the “end” refers to the time when the car ends parking.
Besides, most vehicles are assigned one-to-one. Only truck drivers are not assigned cars but are allowed to use available truck for business
car_assignments <- read_csv("data/car-assignments.csv")
# check car assignment data
knitr::kable(car_assignments,
caption = "Car assignment") %>%
kableExtra::kable_paper("hover", full_width = F) %>%
kableExtra::scroll_box(height = "300px")
| LastName | FirstName | CarID | CurrentEmploymentType | CurrentEmploymentTitle |
|---|---|---|---|---|
| Calixto | Nils | 1 | Information Technology | IT Helpdesk |
| Azada | Lars | 2 | Engineering | Engineer |
| Balas | Felix | 3 | Engineering | Engineer |
| Barranco | Ingrid | 4 | Executive | SVP/CFO |
| Baza | Isak | 5 | Information Technology | IT Technician |
| Bergen | Linnea | 6 | Information Technology | IT Group Manager |
| Orilla | Elsa | 7 | Engineering | Drill Technician |
| Alcazar | Lucas | 8 | Information Technology | IT Technician |
| Cazar | Gustav | 9 | Engineering | Drill Technician |
| Campo-Corrente | Ada | 10 | Executive | SVP/CIO |
| Calzas | Axel | 11 | Engineering | Hydraulic Technician |
| Cocinaro | Hideki | 12 | Security | Site Control |
| Ferro | Inga | 13 | Security | Site Control |
| Dedos | Lidelse | 14 | Engineering | Engineering Group Manager |
| Bodrogi | Loreto | 15 | Security | Site Control |
| Vann | Isia | 16 | Security | Perimeter Control |
| Flecha | Sven | 17 | Information Technology | IT Technician |
| Frente | Birgitta | 18 | Engineering | Geologist |
| Frente | Vira | 19 | Engineering | Hydraulic Technician |
| Fusil | Stenig | 20 | Security | Building Control |
| Osvaldo | Hennie | 21 | Security | Perimeter Control |
| Nubarron | Adra | 22 | Security | Badging Office |
| Lagos | Varja | 23 | Security | Badging Office |
| Mies | Minke | 24 | Security | Perimeter Control |
| Herrero | Kanon | 25 | Engineering | Geologist |
| Onda | Marin | 26 | Engineering | Drill Site Manager |
| Orilla | Kare | 27 | Engineering | Drill Technician |
| Borrasca | Isande | 28 | Engineering | Drill Technician |
| Ovan | Bertrand | 29 | Facilities | Facilities Group Manager |
| Resumir | Felix | 30 | Security | Security Group Manager |
| Sanjorge Jr. | Sten | 31 | Executive | President/CEO |
| Strum | Orhan | 32 | Executive | SVP/COO |
| Tempestad | Brand | 33 | Engineering | Drill Technician |
| Vann | Edvard | 34 | Security | Perimeter Control |
| Vasco-Pais | Willem | 35 | Executive | Environmental Safety Advisor |
| Hafon | Albina | NA | Facilities | Truck Driver |
| Hawelon | Benito | NA | Facilities | Truck Driver |
| Hawelon | Claudio | NA | Facilities | Truck Driver |
| Mies | Henk | NA | Facilities | Truck Driver |
| Morlun | Valeria | NA | Facilities | Truck Driver |
| Morlun | Adan | NA | Facilities | Truck Driver |
| Morluniau | Cecilia | NA | Facilities | Truck Driver |
| Nant | Irene | NA | Facilities | Truck Driver |
| Scozzese | Dylan | NA | Facilities | Truck Driver |
gps2_stop_sf <- left_join(gps2_stop_sf,
car_assignments, by = c("id" = "CarID"))
Lastly, we also need to draw the car movement path on the map. It requires the GPS data to be coordinate formats and one path is actually one line string with multiple GPS points.
# convert values from numerical to factor data type
gps$day <- as.factor(gps$day)
gps$id <- as_factor(gps$id)
gps_sf <- st_as_sf(gps,
coords = c("long", "lat"),
crs = 4326)
# group car paths
gps_path <- gps_sf %>%
group_by(id, day) %>%
summarize(m =mean(Timestamp),
do_union=FALSE) %>%
st_cast("LINESTRING")
The tourist map provided is not georeferenced. And QGIS can help to georeference an image with the ESRI shapefiles (geospatial vector data) of the city.
The process includes:
After the process, we will get a tif file which is a combination of tourist map and georeferenced road map. Then we can plot car movements line with longitude and latitude data on the map.
we need to import the tif file generated by QGIS and display the map.
bgmap <- raster("data/Geospatial/MC2-tourist.tif")
Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?
To identify popularity, we can calculate the card usage frequency and amount in every locations of different days and hours.
Firstly, let’s plot the frequency of cards in the 14 days. We need to calculate the card usage frequency in different days, convert into data frame, draw their heatmaps and plot together.
# calculate the frequency data frame of credit and loyalty card usage
cc_freq_day <- as.data.frame(xtabs(~location+day, data = cc))
loyalty_freq_day <- as.data.frame(xtabs(~location+day, data = loyalty))
# join the two frequency data frame
freq_day_join <- full_join(cc_freq_day,loyalty_freq_day,by= c("location","day"))
names(freq_day_join) <- c("location","day","CC_Freq","Loyalty_Freq")
# transfer from factors to numeric with original values
freq_day_join$day <- as.numeric(levels(freq_day_join$day))[freq_day_join$day]
# plot the heatmap of credit card usage frequency
p1 <- ggplot(freq_day_join,aes(x=day,y=location))+
geom_tile(aes(fill=CC_Freq))+
scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.title=element_blank())
# plot the heatmap of loyalty card usage frequency
p2 <- ggplot(freq_day_join,aes(x=day,y=location))+
geom_tile(aes(fill=Loyalty_Freq))+
scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.title=element_blank())
# convert static graph into interactive
plotly::subplot(ggplotly(p1),
ggplotly(p2),
shareY = TRUE)
Figure 1: Daily Frequency of Credit (left) and Loyalty (right) Card Usage
From the card usage frequency (or consumption frequency), we can easily identify that “Katerina’s Café”, “Hippokampos” and “Brew’ve Been Served” are the most popular with almost all squares in deeper color, where the daily consumption frequency is above 10. “Hallowed Grounds” and “Guy’s Gyros” are slightly less popular.
Besides, we can find that “Brew’ve Been Served” and “Hallowed Grounds” are popular every day except weekends (day 11-12, 18-19). The frequency are 0 on weekends, which might because the location is closed on weekends. It’s the same to “Hallowed Grounds”.
On weekends, “Katerina’s Café” and “Hippokampos” are the most popular while other locations might be closed or less consumption these days.
As for anomalies, we can see there is one white line in the graph for loyalty card, corresponding to “Daily Dealz”. This location only have one credit card consumption record on day 13 and no loyalty card record among the two weeks.
The daily frequencies are the same between “Maximum Iron and Steel” and “Kronos Pipe and Irrigation” every day in the two weeks.
To correct these anomalies, we can check the GPS data to make sure who made the only one consumption in “Daily Dealz”. If there were no anomalies after checking, we can just delete this single record in the credit card data. And for the situation between “Maximum Iron and Steel” and “Kronos Pipe and Irrigation”, it’s just coincidence after checking the consumption amount.
Secondly, we can plot the consumption amount instead of frequency. The steps are almost the same.
cc_price_matrix <- tapply(cc$price,cc[,c("location","day")],sum)
cc_price <- reshape2::melt(cc_price_matrix)
loyalty_price_matrix <- tapply(loyalty$price,loyalty[,c("location","day")],sum)
loyalty_price <- reshape2::melt(loyalty_price_matrix)
price_day_join <- full_join(cc_price,loyalty_price,by= c("location","day"))
names(price_day_join) <- c("location","day","Price.","Price")
p1_price <- ggplot(price_day_join,aes(x=day,y=location))+
geom_tile(aes(fill=Price.))+
scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.title=element_blank())
p2_price <- ggplot(price_day_join,aes(x=day,y=location))+
geom_tile(aes(fill=Price))+
scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.title=element_blank())
plotly::subplot(ggplotly(p1_price),
ggplotly(p2_price),
shareY = TRUE)
Figure 2: Daily Consumption Amount of Credit and Loyalty Card
The consumption amount differences among locations are much bigger than frequency differences.
Apparently, “Abila Airport” are the place where has the biggest consumption amount. And these consumption occurred on weekdays only.
Besides, “Stewart and Sons Fabrication”, “Nationwide Refinery” and “Abila Airport” also have high consumption amounts on weekdays. All these locations don’t show high frequency values in previous graphs but have very high daily consumption amounts.
And there are many outliers which might be anomalies. For example, “Frydos Autosupply n’ More” had a daily cc consumption amount ($10455.22) in day 13, which is several times as much as those in other days. And the loyalty consumption amount in day 8 at “Nationwide Refinery” is 12554.91.
What’s more, there are many inconsistencies between amounts in the credit card record and loyalty card record. At “Stewart and Sons Fabrication”, the daily amounts from day 13 to day 16 don’t match in two graphs.
To correct these anomalies, we need to check through the car movement data where the consumption amount outliers exist. It’s to see whether there are activities or other gathering to cause the high consumption. As for the inconsistency in amounts, the possible explanations are there might be someone used only one of the two cards or got cashback in the consumption.
Lastly, we change the time unit from days to hours to analyze the popular locations. Only the timestamp of credit card data contains time, so there are no hourly heatmaps for loyalty card usage.
cc_freq_hour <- as.data.frame(xtabs(~location+hour, data = cc))
# convert factor into number
cc_freq_hour$hour <- as.numeric(levels(cc_freq_hour$hour))[cc_freq_hour$hour]
cc_price_hour_matrix <- tapply(cc$price,cc[,c("location","hour")],sum)
cc_price_hour <- reshape2::melt(cc_price_hour_matrix)
cc_hour_join <- full_join(cc_freq_hour, cc_price_hour, by= c("location","hour"))
names(cc_hour_join) <- c("location","hour","Freq","Amount")
p3_freq <- ggplot(cc_hour_join,aes(x=hour,y=location))+
geom_tile(aes(fill=Freq))+
scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
p3_price <- ggplot(cc_hour_join,aes(x=hour,y=location))+
geom_tile(aes(fill=Amount))+
scale_fill_gradient(low = "#deeff7", high = "#0D2330")+
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank())
plotly::subplot(ggplotly(p3_freq),
ggplotly(p3_price),
shareY = TRUE) %>%
hide_colorbar()
Figure 3: Hourly Consumption Frequency and Amount of Credit Card
From the left hourly heatmap, we can easily identify the popular period for each locations since there are clear pattern.
And some anomalies exist in the strange time period. At 3am, there are 5 credit card usages in “Kronos Mart”. For “Daily Dealz”, the only credit card transaction happened at 6am.
The right heatmaps also show anomalies: A very high consumption amount ($32419.63) happened at 11am of “Stewart and Sons Fabrication”.
Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?
Firstly, filter out the credit card consumption record at “Frydos Autosupply n’ More” on day 13.
knitr::kable(cc %>%
filter(day == 13 & location == "Frydos Autosupply n' More"),
caption = "Consumption record at Frydos Autosupply n' More on day 13") %>%
kableExtra::kable_paper("hover", full_width = F)
| timestamp | location | price | last4ccnum | date | day | hour |
|---|---|---|---|---|---|---|
| 2014-01-13 19:20:00 | Frydos Autosupply n’ More | 10000.00 | 9551 | 2014-01-13 | 13 | 19 |
| 2014-01-13 19:41:00 | Frydos Autosupply n’ More | 188.57 | 8129 | 2014-01-13 | 13 | 19 |
| 2014-01-13 19:59:00 | Frydos Autosupply n’ More | 64.60 | 8411 | 2014-01-13 | 13 | 19 |
| 2014-01-13 21:11:00 | Frydos Autosupply n’ More | 202.05 | 2418 | 2014-01-13 | 13 | 21 |
The abnormal consumption is from the cc number 9551. Let’s check the consumption records of this cc on day 13.
knitr::kable(cc %>%
filter(day == 13 & last4ccnum == 9551),
caption = "Consumption record of cc 9551 owner on day 13") %>%
kableExtra::kable_paper("hover", full_width = F)
| timestamp | location | price | last4ccnum | date | day | hour |
|---|---|---|---|---|---|---|
| 2014-01-13 06:04:00 | Daily Dealz | 2.01 | 9551 | 2014-01-13 | 13 | 6 |
| 2014-01-13 13:18:00 | U-Pump | 55.25 | 9551 | 2014-01-13 | 13 | 13 |
| 2014-01-13 13:28:00 | Hippokampos | 30.51 | 9551 | 2014-01-13 | 13 | 13 |
| 2014-01-13 19:20:00 | Frydos Autosupply n’ More | 10000.00 | 9551 | 2014-01-13 | 13 | 19 |
| 2014-01-13 19:30:00 | Ouzeri Elian | 28.75 | 9551 | 2014-01-13 | 13 | 19 |
We can see the cc owner make the only only transaction at “Daily Dealz” at early morning (6am), which is the only one transaction in the two weeks.
Besides, “U-Pump” is a special place because there were only two consumption records in the two weeks, which can be found in Figure 3 and Figure 1. Therefore, there should have fewer stop locations near U-Pump in the car GPS data.
We can check the stop locations on day 13. On this day, there was one point near U-Pump where the stop time is near the consumption time in “U-Pump”. The corresponding car id is 24.
Thus, we think that Minke, the owner of car 24, use the credit card 9551. Let’s draw the moving path of this car to discover more. All elements are draw in the same plot to enhance understanding
gps2_stop_day13 <- gps2_stop_sf %>%
filter(day ==13)
gps2_stop_car24_day13 <- gps2_stop_sf %>%
filter(day ==13 & id == 24)
gps_path_car24_day13 <- gps_path %>%
filter(day == 13 & id == 24)
map1 <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_car24_day13) +
tm_lines(col = "blue") +
tm_shape(gps2_stop_day13) +
tm_dots() +
tm_shape(gps2_stop_car24_day13) +
tm_dots(col = "blue", size = 0.1)
tmap_leaflet(map1)
Figure 4: Stop locations and driving path of car 24 on day 13
Hovering over blue dots, we can see the stop locations of car 24. On day 13, the car started running at about 7pm from home (the east area in the map) and stopped at “Katerina’s Café” (the south-east area) for half an hour. Then, the car stopped near “Albert’s Fine Clothing” at around noon (the north-west area).
After that, the car stopped near “U-Pump” (the center area) from 12:35 to 13:22. The purchase time in “U-Pump”, 13:18:00, matches the time period.
From 13:27 to 17:57, the car stopped at the GASTech company (south area), which could be the employee was working.
After the work, the car stopped near the “Brew’ve Been Served” (the south-east area) from 18:00 to 19:29. The high consumption occured in this period. The stop location is also close to the “Frydos Autosupply n’ More”. So the driver might stopped the car and walked to the “Frydos Autosupply n’ More” to make the consumption.
There are strange things.
The consumption at “Daily Dealz” occurred at 06:04:00, while the car left home at 07:32:01. It’s strange that the purchase happedned so early and the location can’t be found in other records
The consumption at “Hippokampos” occurred at 13:28:00, while the car stopped at the company at 13:27:14. The time gap is about 30 seconds
The consumption at “Ouzeri Elian” occurred at 19:30:00, while the car left the “Frydos Autosupply n’ More” at 19:29:01. The time gap is just 30 seconds after the car left
We can check the consumption from the combination of credit and loyalty cards data. We use left join to find the corresponding records in the loyalty data.
knitr::kable(cc %>%
filter(day == 13 & last4ccnum == 9551) %>%
left_join(loyalty, by = c("location", "day", "price")),
caption = "Consumption record of cc 9551 with corresponding loyalty records on day 13") %>%
kableExtra::kable_paper("hover", full_width = F)
| timestamp.x | location | price | last4ccnum | date | day | hour | timestamp.y | loyaltynum |
|---|---|---|---|---|---|---|---|---|
| 2014-01-13 06:04:00 | Daily Dealz | 2.01 | 9551 | 2014-01-13 | 13 | 6 | NA | NA |
| 2014-01-13 13:18:00 | U-Pump | 55.25 | 9551 | 2014-01-13 | 13 | 13 | NA | NA |
| 2014-01-13 13:28:00 | Hippokampos | 30.51 | 9551 | 2014-01-13 | 13 | 13 | 2014-01-13 | L5777 |
| 2014-01-13 19:20:00 | Frydos Autosupply n’ More | 10000.00 | 9551 | 2014-01-13 | 13 | 19 | NA | NA |
| 2014-01-13 19:30:00 | Ouzeri Elian | 28.75 | 9551 | 2014-01-13 | 13 | 19 | 2014-01-13 | L5777 |
We can see that the two consumption records, which have little time gap with the car leaving/stopping, exactly have corresponding loyalty card usage. But the other three consumption records should be less rush but didn’t use loyalty card. One possible explaination might be the card stealing. This suspicious activity need to be analyzed further in question 5.
The first step is to find the corresponding records.
knitr::kable(cc %>%
filter(location == "Kronos Mart"),
caption = "Consumption at Kronos Mart") %>%
kableExtra::kable_paper("hover", full_width = F)
| timestamp | location | price | last4ccnum | date | day | hour |
|---|---|---|---|---|---|---|
| 2014-01-10 09:30:00 | Kronos Mart | 203.91 | 7688 | 2014-01-10 | 10 | 9 |
| 2014-01-12 03:39:00 | Kronos Mart | 277.26 | 8156 | 2014-01-12 | 12 | 3 |
| 2014-01-13 03:00:00 | Kronos Mart | 147.30 | 5407 | 2014-01-13 | 13 | 3 |
| 2014-01-13 08:01:00 | Kronos Mart | 159.06 | 6816 | 2014-01-13 | 13 | 8 |
| 2014-01-14 08:20:00 | Kronos Mart | 58.85 | 6899 | 2014-01-14 | 14 | 8 |
| 2014-01-16 07:30:00 | Kronos Mart | 298.83 | 7108 | 2014-01-16 | 16 | 7 |
| 2014-01-17 08:08:00 | Kronos Mart | 286.24 | 1415 | 2014-01-17 | 17 | 8 |
| 2014-01-19 03:13:00 | Kronos Mart | 87.66 | 3484 | 2014-01-19 | 19 | 3 |
| 2014-01-19 03:45:00 | Kronos Mart | 194.51 | 9551 | 2014-01-19 | 19 | 3 |
| 2014-01-19 03:48:00 | Kronos Mart | 150.36 | 8332 | 2014-01-19 | 19 | 3 |
The strange consumption records are the last 3 rows, which occurred at 3 o’clock on day 19 by the owner of credit cards 3484, 9551, 8332.
Coincidentally, credit card 9551 also appeared in the Anomaly 1.
Day 19 is one day before the employee missing incident. We can check the car stop points in the recent one week to find the reason or any anomalies.
gps2_stop_days <- gps2_stop_sf %>%
filter(between(day,13,18))
gps2_stop_day19 <- gps2_stop_sf %>%
filter(day == 19)
map2 <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps2_stop_days) +
tm_dots(size = 0.1, alpha = 0.5) +
tm_shape(gps2_stop_day19) +
tm_dots(col = "red", size = 0.1, alpha = 0.5)
tmap_leaflet(map2)
Figure 5: Stop locations from day 13 to day 19
The “Kronos Mart” is located at the west direction with a red symbol. After zooming the map, we can see there were no car stop location near the mart On day 19 (red dot). And several closer red dot, which located at “Roberts and Sons”, were in the afternoon (stop period within 13 to 14 o’clock).
But there are three black dot which are very close to the “Kronos Mart”. Furthermore, The three car stop all started at about 13:30 and ends at about 16:00 on day 18.
The owners of the three cars are listed.
gps2_stop_days %>%
filter((id == 1 | id == 10 |id == 23)
& day == 18
& start > "2014-01-18 13:00:00"
& end < "2014-01-18 16:00:00") %>%
knitr::kable(caption = "The three car stop near Kronos Mart") %>%
kableExtra::kable_paper("hover", full_width=T)
| start | end | id | diff_mins | day | geometry | LastName | FirstName | CurrentEmploymentType | CurrentEmploymentTitle |
|---|---|---|---|---|---|---|---|---|---|
| 2014-01-18 13:48:01 | 2014-01-18 15:14:01 | 23 | 86.0 mins | 18 | POINT (24.8498 36.06586) | Lagos | Varja | Security | Badging Office |
| 2014-01-18 13:29:31 | 2014-01-18 15:52:01 | 10 | 142.5 mins | 18 | POINT (24.84983 36.06588) | Campo-Corrente | Ada | Executive | SVP/CIO |
| 2014-01-18 13:36:43 | 2014-01-18 15:58:01 | 1 | 141.3 mins | 18 | POINT (24.84982 36.06582) | Calixto | Nils | Information Technology | IT Helpdesk |
We can’t get insights from the car owner information since they belong to different employment type. But they stayed at the same location for similar time period. They are very likely to meet each other and do the same thing. Besides, the consumption at 3 o’clock came from 3 credit cards and this meetup also involved in 3 persons.
Thus, one possible explanation of the consumption at mid-night could be that the three car owners came to discuss some plans on day 18 and met again at 3 o’clock on day 19.
Another possible explanation direction could be persons just stayed near the mart, so they don’t need to drive and walked there to make consumption. Or the three person use other vehicles, not from the company, to reach the mart and make consumption.
This suspicious activities will be analyzed further in question 5. It might need to check the behaviors of the three car owners in the 14 days.
cc_num <- length(unique(cc$last4ccnum))
loyalty_num <- length(unique(loyalty$loyaltynum))
ppl_num <- length(car_assignments$LastName)
c(cc_num, loyalty_num, ppl_num)
[1] 55 54 44
We can find that there are 44 employees, but 55 credit cards and 54 loyalty cards. If we suppose no errors in the card id, one employee has one or more credit cards and one or more loyalty cards.
Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?
If one consumption time fall with one car stop period, we believe it’s possible one correspondence. There might be several different purchase in one car stop period, but we can infer that the most common pair within one group is the most likely true pair.
The credit card data contain specific time, but loyalty card data doesn’t. Thus, We will find the relationship between credit cards and loyalty cards. After that, we will match the credit card and car stop. The owners of loyalty card can be inferred from the relationship between credit card and car stop and the relationship between credit cards and loyalty cards.
We full join the two card table by matching day, location and price. It’s seldom that two different consumption will have the same values in the three features. Then, we use group_by() to find all pairs of two cards and count the consumption frequency of the pair.
There will be some rows which can’t match. This might be someone used only one of the two cards or got cashback. We filter out these situations, where the card pair contains null value.
# # make a full join
card_correspond_count <- full_join(cc, loyalty,
by = c("day", "location", "price")) %>%
# calculate frequency
group_by(last4ccnum, loyaltynum) %>%
summarise(count = n()) %>%
# filter out mismatch
drop_na()
# convert 'last4ccnum' into string to plot
card_correspond_count$last4ccnum <- as.character(card_correspond_count$last4ccnum)
Most pairs are one-on-one. It’s confident to conclude there pairs are true (credit and loyalty card in each pair belong to one owner).
card_correspond_count_one2one <- card_correspond_count %>%
filter((n_distinct(last4ccnum)==1 & n_distinct(loyaltynum)==1))
knitr::kable(card_correspond_count_one2one,
caption = "One-on-one matched pairs") %>%
kableExtra::kable_paper("hover", full_width = F) %>%
kableExtra::scroll_box(height = "300px")
| last4ccnum | loyaltynum | count |
|---|---|---|
| 1310 | L8012 | 21 |
| 1321 | L4149 | 22 |
| 1415 | L7783 | 24 |
| 1874 | L4424 | 25 |
| 1877 | L3014 | 18 |
| 2142 | L9637 | 25 |
| 2276 | L3317 | 10 |
| 2418 | L9018 | 20 |
| 2463 | L6886 | 23 |
| 2540 | L5947 | 20 |
| 2681 | L1107 | 20 |
| 3484 | L2490 | 24 |
| 3492 | L7814 | 22 |
| 3506 | L7761 | 6 |
| 3547 | L9362 | 14 |
| 3853 | L1485 | 22 |
| 4434 | L2169 | 26 |
| 4530 | L8477 | 10 |
| 5010 | L2459 | 5 |
| 5407 | L4034 | 20 |
| 6691 | L6267 | 20 |
| 6816 | L8148 | 20 |
| 6895 | L3366 | 21 |
| 6899 | L6267 | 23 |
| 6901 | L9363 | 28 |
| 7108 | L6544 | 16 |
| 7117 | L6417 | 28 |
| 7253 | L1682 | 24 |
| 7354 | L9254 | 21 |
| 7384 | L3800 | 26 |
| 7688 | L4164 | 22 |
| 7792 | L5756 | 20 |
| 7819 | L5259 | 27 |
| 8129 | L8328 | 21 |
| 8156 | L5224 | 22 |
| 8202 | L2343 | 25 |
| 8411 | L6110 | 27 |
| 8642 | L2769 | 12 |
| 9152 | L5485 | 2 |
| 9220 | L4063 | 9 |
| 9241 | L3288 | 13 |
| 9405 | L3259 | 21 |
| 9551 | L5777 | 12 |
| 9614 | L5924 | 2 |
| 9617 | L5553 | 26 |
| 9635 | L3191 | 24 |
| 9683 | L7291 | 18 |
| 9735 | L9633 | 16 |
For those who have matched but not one-on-one, we can plot their parallel graph.
# filter multiple match
card_correspond_count_others <- card_correspond_count %>%
filter(n_distinct(last4ccnum)>1 | n_distinct(loyaltynum)>1)
card_correspond_count_others_plot <- card_correspond_count_others %>%
gather_set_data(1:2) %>% # <- ggforce helper function
arrange(x,last4ccnum,desc(loyaltynum))
# plot
ggplot(card_correspond_count_others_plot,
aes(x = x, id = id, split = y, value = count)) +
geom_parallel_sets(aes(fill = last4ccnum), alpha = 0.7,
axis.width = 0.2, n=100, strength = 0.5) +
geom_parallel_sets_axes(axis.width = 0.25, fill = "gray95",
color = "gray80", size = 0.15) +
geom_parallel_sets_labels(colour = 'gray35', size = 4.5,
angle = 0, fontface="bold") +
theme_minimal() +
theme(
legend.position = "none",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.y = element_blank(),
axis.text.x = element_text(size = 12, face = "bold"),
axis.title.x = element_blank()
)
Figure 6: Multiple matched pairs
We can infer that the pair with a wider line is the true pair. For example, credit card 4795 (Blue line) matches both L2070 and L8566 loyalty cards, but we can make sure that L8566 is the true pair. Because the the consumption count by L8566, which is displayed as the line width, is much higher than L2070.
knitr::kable(card_correspond_count_others,
caption = "Other matched pairs") %>%
kableExtra::kable_paper("hover", full_width = F) %>%
kableExtra::scroll_box(height = "300px")
| last4ccnum | loyaltynum | count |
|---|---|---|
| 1286 | L3288 | 15 |
| 1286 | L3572 | 13 |
| 4795 | L2070 | 1 |
| 4795 | L8566 | 25 |
| 4948 | L3295 | 1 |
| 4948 | L9406 | 22 |
| 5368 | L2247 | 24 |
| 5368 | L6119 | 1 |
| 5921 | L3295 | 12 |
| 5921 | L9406 | 1 |
| 7889 | L2247 | 1 |
| 7889 | L6119 | 20 |
| 8332 | L2070 | 27 |
| 8332 | L8566 | 1 |
These matched pairs with 1 count might contain some suspicious activities. And we can assign the rows with over 5 count to be true pairs.
In the final predicted card pairs, only credit cards 1286 correspond to multiple loyalty cards (L3288, L3572), which can be found in Figure 6
To match credit card consumption and GPS data, we can assume that one car stop corresponds to one consumption if the consumption time falls within the car stop period at the same location.
But before that, we have to label GPS of car stops with specific locations.
All car stop locations are plotted on the map. And we can see there are many locations where the car stopped for over 6 hours (red dot on the map). Most of them are near the five parks (along the coast).
We are interested in car stops where credit card consumption happened, so we should exclude these stops which are very likely at home.
Besides, we notice that there are some long car stop near other locations. Those blue dots near “Ouzeri Elian” all belong to car 28, Isande. And he/she drives car very regular: stops at about 8:00 and leave at about 17:00.
It’s the same for car 9, Gustav. The car has many long stops near “Bean There Done That” (north-west area): stopped at about 17:00 and start moving on the second day at about 8:00. It seems that he lives here.
gps2_stop_long <- gps2_stop_sf %>%
filter(diff_mins >= 60*6)
gps2_stop_short <- gps2_stop_sf %>%
filter(diff_mins < 60*6)
map4 <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps2_stop_short) +
tm_dots(size = 0.1, alpha = 0.5) +
tm_shape(gps2_stop_long) +
tm_dots(col = "blue", size = 0.2, alpha = 0.3)
tmap_leaflet(map4)
Figure 7: Long and short car stop
After excluding the long on short car stop, We can find that these car stops are still messy in Figure 7. It’s difficult to distinctly group car stops and label with locations.
Many car stops locations might not correspond to any local business locations. And some car stops, which actually correspond consumption in one location, don’t have close GPS location. Furthermore, some locations are very near each other.
To fix this issue, we can find the most likely correspondence by their distinct car stop locations. For example, the multiple dots near the “Abila Airport”(southwest) should be a clear/distinct group. They are not single stops, close within one group and far from other dots&locations. Thus, We can believe that they correspond to the airport location with high confidence.
It’s the same for “Maximum Iron and Steel”(west), “Abila Scrapyard”(northwest), “Frank’s Fuel”(west), “Bean There Done That”(northwest), “Coffee Cameleon”(southeast), “Chostus Hotel”(northeast).
Some dots might not be a distinct group and there are several dot groups near one location. But some groups still can be labeled to one location with confidence, such as some dots along the street of the “Kronos Mart”(west). “Roberts and Sons”(west), “Desafio Golf Course”(northeast), “Albert’s Fine Clothing”(north), “Jack’s Magical Beans”(northeast), “Hallowed Grounds”(east) also have such dots groups.
After labeling these dots groups, we can match the credit card by the timestamp and location labels. And we can filter out some car owner who have matched pairs. Then, within one pair, we check the car stops of the car and consumption of the credit card to match more car stop with locations.
During this process, we repeat labeling confident car stop, filtering out pairs, checking new pairs and labeling new car stop until the new pairs is not confident or inconsistent with confident pairs.
Some of these confidently labeled car stops locations might be wrong. But they need to match the consumption in the to be the true GPS and consumption pair. This will decrease the error rate a lot since it’s much less possible that one specific car stop with a wrong location label can match one record in that location at the matched time period.
### confident pair
### cars&consumption with pairs
### match new pairs
###
Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships
1
[1] 1
1
[1] 1
1
[1] 1
Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why.
1
[1] 1
1
[1] 1
1
[1] 1
trucks which were used for non-business issue
day*hour Now, let’s divide the units from days into hours:
cc_freq_day_hour <- as.data.frame(xtabs(~location++day+hour, data = cc))
cc_freq_day_hour$hour <- as.numeric(levels(cc_freq_day_hour$hour))[cc_freq_day_hour$hour]
p3 <- ggplot(cc_freq_day_hour,aes(x=hour,y=location))+
geom_tile(aes(fill=Freq),color="white")+
scale_fill_gradient(low = "#EFF7FB", high = "#0D2330")+
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.title=element_blank(),
plot.title = element_text(hjust=0.5))+
facet_wrap(~ day, ncol = 7)+
labs(title = "CC Frequency by hour of the day")
ggplotly(p3)
# knitr::kable(card_correspong_count,
# caption = ) %>%
# kableExtra::kable_paper("hover", full_width = F)